While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
translated by 谷歌翻译
人脸图像通常以广泛的视觉量表出现。现有的面部表示通过组装有限系列的预定尺度的多尺度方案来追求处理量表变化的带宽。这种多弹药方案带来了推理负担,而预定义的量表不可避免地从真实数据中差异。取而代之的是,从数据中学习比例参数,并将其用于单发功能推理是一个不错的解决方案。为此,我们通过诉诸规模空间理论并实现两倍的设施来改革Conv层:1)Conv层从真实数据分布中学习一组尺度,每个数据分布都由Conv内核来实现; 2)该图层自动在适当的通道和位置上突出显示与输入模式量表及其存在相对应的位置。然后,我们通过堆叠改革层的层来实现分层尺度的关注,建立一种名为“比例尺注意Cons Neurnet网络”(\ textbf {scan-cnn})的新颖风格。我们将扫描CNN应用于面部识别任务,并推动SOTA性能的前沿。当面部图像模糊时,准确性增长更为明显。同时,作为单发方案,该推断比多弹性融合更有效。与普通CNN相比,制造了一组工具,以确保对扫描CNN进行快速训练和推理成本的零增加。
translated by 谷歌翻译
手语制作(SLP)旨在将口语语言自动转化为符号序列。 SLP的核心过程是将符号光泽序列转换为其相应的标志姿势序列(G2P)。大多数现有的G2P模型通常以自回归方式执行这种条件的远程生成,这不可避免地导致错误的积累。为了解决这个问题,我们提出了一种量化量子序列序列的生成的矢量量化扩散方法,称为poseVQ扩散,这是一种迭代性非自动入学方法。具体而言,我们首先引入量化量化变量自动编码器(姿势VQVAE)模型,以表示姿势序列作为一系列潜在代码。然后,我们通过最近开发的扩散体系结构的扩展来对潜在离散空间进行建模。为了更好地利用时空信息,我们介绍了一种新颖的体系结构,即CodeUnet,以在离散空间中生成更高质量的姿势序列。此外,利用学习的代码,我们开发了一种新型的顺序k-nearest-neighbours方法,以预测相应的光泽序列的姿势序列的可变长度。因此,与自回旋G2P模型相比,我们的模型具有更快的采样速度,并产生明显更好的结果。与以前的非自动入学G2P方法相比,PoseVQ扩散通过迭代改进改善了预测的结果,从而在SLP评估基准上获得了最新的结果。
translated by 谷歌翻译
最近的顺序推荐模型越来越多地依赖连续的短期用户相互作用序列来建模用户兴趣。但是,这些方法引起了人们对短期和长期利益的关注。 (1){\ IT短期}:交互序列可能不是由单一的兴趣引起的,而是来自几个相互交织的利益,即使在短时间内,也导致了它们无法模拟Skip行为的失败; (2){\ it长期}:相互作用序列主要是在离散的间隔内稀疏观察,而不是长期连续的。这使得难以推断长期利益,因为只能考虑到跨序列的利益动态,因此只能得出离散的利息表示。在这项研究中,我们通过学习来解决这些问题(1)短期利益的多尺度表示; (2)长期利益的动态意识表示。为此,我们提出了一个\ textbf {i} nterest \ textbf {d} ynamics建模框架,使用生成\ textbf {n} eural \ textbf {p textbf {p} rocesses,coincined IDNP,以从功能角度来看,以模拟用户兴趣。 IDNP学习了一个全球兴趣函数家族,以定义每个用户的长期兴趣作为功能实例化,从而通过功能连续性表现出兴趣动态。具体而言,IDNP首先将每个用户的短期交互编码为多尺度表示,然后将其汇总为用户上下文。通过将潜在的全球兴趣与用户上下文相结合,IDNP然后重建长期用户兴趣功能,并在即将到来的查询时间段上预测交互。此外,即使相互作用序列受到限制和非连续性,IDNP也可以建模此类兴趣功能。在四个现实世界数据集上进行的广泛实验表明,我们的模型在各种评估指标上的最先进。
translated by 谷歌翻译
深度神经网络(DNN)已经证明了他们在各种域中的表现。但是,它提出了社会问题,如果他们适用于涉及有价值的资源分配的敏感域,如教育,贷款和就业,则会引发社会问题。在DNN可靠地部署到这样的敏感域之前,执行公平性测试至关重要,即,尽可能多地生成以发现公平违规的情况。然而,现有的测试方法仍然有限于三个方面:可解释性,性能和概括性。为了克服挑战,我们提出了一个新的DNN公平测试框架,与以前的工作不同于在几个关键方面的内容:(1)可解释 - 它定量解释DNNS的公平违反偏见决定的公平违规; (2)有效 - 它使用解释结果在更少的时间内引导更多样化的情况; (3)通用 - 它可以处理结构化和非结构化数据。在7个数据集中的广泛评估和相应的DNN展示了神经元的优越性。例如,在结构化数据集上,它会产生更多的实例(〜x5.84)并节省更多时间(平均加速度为534.56%),与最先进的方法相比。此外,还可以利用神经元的情况来改善偏置DNN的公平,这有助于构建更公平和值得信赖的深度学习系统。
translated by 谷歌翻译
3D场景理解是一个相对新兴的研究领域。在本文中,我们介绍了3D现实世界场景(VQA-3D)中的视觉问题应答任务,旨在给出3D场景的所有可能的问题。为了解决这个问题,提出了第一个VQA-3D数据集,即CLEVR3D,其中包含在1,129个现实世界场景中的60k个问题。具体而言,我们开发一个问题发动机利用3D场景图结构来生成不同的推理问题,涵盖物体属性的问题(即,大小,颜色和材料)及其空间关系。建立在此数据集之上,我们进一步设计了第一个VQA-3D基线模型TransVQA3D。 TransVQA3D型号采用精心设计的变压器架构,实现优越的VQA-3D性能,与纯语言基线和先前的3D推理方法直接应用于3D场景。实验结果验证了VQA-3D作为辅助任务可以提高3D场景理解的性能,包括节点明智分类和全图识别的场景图分析。
translated by 谷歌翻译
Knowledge graph embedding (KGE), which maps entities and relations in a knowledge graph into continuous vector spaces, has achieved great success in predicting missing links in knowledge graphs. However, knowledge graphs often contain incomplete triples that are difficult to inductively infer by KGEs. To address this challenge, we resort to analogical inference and propose a novel and general self-supervised framework AnKGE to enhance KGE models with analogical inference capability. We propose an analogical object retriever that retrieves appropriate analogical objects from entity-level, relation-level, and triple-level. And in AnKGE, we train an analogy function for each level of analogical inference with the original element embedding from a well-trained KGE model as input, which outputs the analogical object embedding. In order to combine inductive inference capability from the original KGE model and analogical inference capability enhanced by AnKGE, we interpolate the analogy score with the base model score and introduce the adaptive weights in the score function for prediction. Through extensive experiments on FB15k-237 and WN18RR datasets, we show that AnKGE achieves competitive results on link prediction task and well performs analogical inference.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks; to facilitate training, existing works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows efficient block-wise training of the residual blocks and avoids inner loops of score matching or variational learning. As the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one-by-one, reducing the memory load and difficulty of performing end-to-end training of deep flow networks. We also develop adaptive time reparameterization of the flow network with a progressive refinement of the trajectory in probability space, which improves the model training efficiency and accuracy in practice. Using numerical experiments with synthetic and real data, we show that the proposed JKO-iFlow model achieves similar or better performance in generating new samples compared with existing flow and diffusion models at a significantly reduced computational and memory cost.
translated by 谷歌翻译
Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.
translated by 谷歌翻译